I made notes outlining the useful features of plotly here.
From that list I think some worth exploring include:
I’m going to combine these to create a cool visualization (hopefully without too much effort).
I want to use something non-controversial that people generally just tend to find interesting. I was initially thinking savings vs expenditures by state or country as a picture of household financial health but access to the data is not as straightforward as I’d expected and economics are so overdone. So, I’ll go with baby names in the US.
I’ll create 2 plots as follows:
Thank you US Social Security Administration! The data comes zipped as .txt files by year (for National data) or by state, with 1 file per year or state respectively. I’m going to have to combine this data into 2 dataframes (instead of a ton of files). The data in files are separated by commas.
download.file("https://www.ssa.gov/OACT/babynames/names.zip",
destfile = "./plotly_data/names.zip") # national data
download.file("https://www.ssa.gov/OACT/babynames/state/namesbystate.zip",
destfile = "./plotly_data/namesbystate.zip") # state data
# time of download
Sys.time() %>%
format(., format = "%e %b %Y at %H:%M:%S", usetz = TRUE) %>%
saveRDS(., file = "./plotly_data/dl_timestamp.Rdata")
# unzip files to tempdir & list
td <- tempdir()
unzip("./plotly_data/names.zip", exdir = td)
unzip("./plotly_data/namesbystate.zip", exdir = td)
US_txt <- list.files(td, "^yob.*\\.txt$", full.names = TRUE)
state_txt <- list.files(td, "^[A-Z]{2}\\.TXT$", full.names = TRUE)
# combine into 2 dataframes
US_baby <- map(US_txt,
~ read_csv(.x,
col_names = c("Name", "Sex", "n"),
col_types = "cci"
) %>%
mutate(., Year = as.integer(str_extract(.x, "[0-9]{4}"))) # add year
) %>%
reduce(rbind)
state_baby <- map(state_txt,
~ read_csv(.x,
col_names = c("State", "Sex", "Year", "Name", "n"),
col_types = "ccici"
)
) %>%
reduce(rbind)
US_no1 <- US_baby %>%
group_by(., Sex, Year) %>%
filter(., n == max(n)) %>%
add_count(., Name)
US_no1
yrs_no1 <- ungroup(US_no1) %>%
count(., Name, Sex, sort = TRUE)
yrs_no1
US_2017 <- filter(US_baby, Year == 2017) %>%
arrange(., desc(n)) %>%
mutate(., US_rank = row_number())
US_2017
state_last <- filter(state_baby, Year == max(Year)) %>%
group_by(., State, Sex) %>%
filter(., n == max(n)) %>%
add_count(., sort = TRUE) %>%
left_join(., select(US_2017, Name, Sex, US_rank), by = c("Name", "Sex"))
state_no1 <- ungroup(state_last) %>%
count(., Name, Sex) %>%
rename(., num_states = nnn)
state_no1
state_comb <- left_join(state_last, state_no1, by = c("Name", "Sex")) %>%
mutate(., # combine info for ties
Name2 = if_else(
lead(paste(State, Sex, Year, n), 1) == paste(State, Sex, Year, n),
paste(Name, lead(Name, 1), sep = " / "),
NA_character_,
missing = Name
),
US_rank2 = if_else(
lead(paste(State, Sex, Year, n), 1) == paste(State, Sex, Year, n),
paste(US_rank, lead(US_rank, 1), sep = " / "),
NA_character_,
missing = as.character(US_rank)
),
num_states2 = if_else(
paste(State, Sex, Year, n) == lead(paste(State, Sex, Year, n), 1),
paste(num_states, lead(num_states, 1), sep = " / "),
NA_character_,
missing = as.character(num_states)
)
)
# remove second line for ties, add hover text
state_final <- filter(state_comb, nn == 1 | (nn == 2 & grepl("/", Name2))) %>%
mutate(.,
hover = paste(
if_else(nn > 1, "TIE - ", ""), Name2,
"<br>Total births = ", n,
"<br>US rank = ", US_rank2,
"<br>No. of states where #1 = ", num_states2,
sep = ""
)
)
state_final <- ungroup(state_final) %>%
mutate(., Sex = recode(Sex, "M" = "Boy", "F" = "Girl")) %>%
left_join(., as_tibble(cbind(state.abb, state.name)), by = c("State" = "state.abb"))
state_final
Got the transform and button info from: https://community.plot.ly/t/need-help-on-using-dropdown-to-filter/6596/2
Color info from: https://moderndata.plot.ly/create-colorful-graphs-in-r-with-rcolorbrewer-and-plotly/
This choropleth works and the button changes the data like I want but I can’t figure out how to get the data change AND the color change.
# specify some map projection/options
ch_opts <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = FALSE
)
p <- plot_geo(state_final, locationmode = 'USA-states') %>%
add_trace(., type = "choropleth",
z = ~US_rank, text = ~hover, locations = ~State,
color = ~US_rank, colors = "PuRd",
transforms = list(
list(
type = "filter",
target = ~Sex,
operation = "=",
value = c("Boy", "Girl"),
z = ~US_rank
)
)
) %>%
colorbar(title = "U.S. Rank") %>%
layout(
title = '2017 Most Popular Baby Names in US by State<br>(Hover for breakdown)',
geo = ch_opts,
updatemenus = list(
list(
type = 'buttons',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", "Boy"),
label = "Boy"),
list(method = "restyle",
args = list("transforms[0].value", "Girl"),
label = "Girl")
)
)
)
)
p
The documentation on plotly for R is just too poor. There are a handful of examples on how to enter something but none of the explanation about why a certain approach is used (and multiple instances in the documentation that say “all the options follow below” without any options.
Time to ask the Stack Overflow community. Creating a minimal, reproducible example.
# prep data for SO
df <- select(state_final, code = State, sex = Sex, us_rank = US_rank) %>%
mutate(., sex = recode(sex, "Boy" = "M", "Girl" = "F"))
dput(df)
# put everything below this on SO
df <- structure(list(code = c("AK", "WI", "WY", "AK", "AL", "AL", "AR",
"AR", "AZ", "AZ", "CA", "CA", "CO", "CO", "CT", "CT", "DC", "DC",
"DE", "DE", "FL", "FL", "GA", "GA", "HI", "HI", "IA", "IA", "ID",
"ID", "IL", "IL", "IN", "IN", "KS", "KS", "KY", "KY", "LA", "LA",
"MA", "MA", "MD", "MD", "ME", "ME", "MI", "MI", "MN", "MN", "MO",
"MO", "MS", "MS", "MT", "MT", "NC", "NC", "ND", "ND", "NE", "NE",
"NH", "NH", "NJ", "NJ", "NM", "NM", "NV", "NV", "NY", "NY", "OH",
"OH", "OK", "OK", "OR", "OR", "PA", "PA", "RI", "RI", "SC", "SC",
"SD", "SD", "TN", "TN", "TX", "TX", "UT", "UT", "VA", "VA", "VT",
"VT", "WA", "WA", "WI", "WV", "WV", "WY"), sex = c("F", "F",
"M", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F",
"M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M",
"F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F",
"M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M",
"F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F",
"M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M",
"F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F", "M", "F",
"M", "F", "M", "F", "M", "M", "F", "M", "F"), us_rank = c(1L,
1L, 2L, 9L, 5L, 7L, 1L, 14L, 1L, 2L, 1L, 4L, 1L, 2L, 3L, 4L,
5L, 9L, 3L, 10L, 6L, 2L, 5L, 7L, 1L, 2L, 28L, 15L, 1L, 15L, 3L,
4L, 1L, 15L, 1L, 15L, 1L, 7L, 3L, 2L, 1L, 11L, 5L, 2L, 18L, 15L,
1L, 2L, 3L, 15L, 3L, 7L, 5L, 7L, 3L, 9L, 5L, 7L, 1L, 15L, 1L,
15L, 18L, 10L, 1L, 2L, 13L, 4L, 1L, 2L, 3L, 2L, 1L, 2L, 1L, 7L,
1L, 15L, 1L, 2L, 18L, 17L, 5L, 7L, 1L, 15L, 5L, 7L, 1L, 4L, 3L,
15L, 3L, 2L, 26L, 37L, 3L, 2L, 29L, 1L, 2L, 1L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -102L))
library(magrittr)
library(plotly)
geo_opts <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = FALSE
)
plot_geo(df, locationmode = 'USA-states') %>%
add_trace(., type = "choropleth",
z = ~us_rank, locations = ~code, color = ~us_rank,
transforms = list(
list(
type = "filter",
target = ~sex,
operation = "=",
value = c("M", "F")
)
)
) %>%
layout(
geo = geo_opts,
updatemenus = list(
list(
type = 'buttons',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", "M"),
label = "Boy"),
list(method = "restyle",
args = list("transforms[0].value", "F"),
label = "Girl")
)
)
)
)
Notes about Layout():
title = call in layout()layout() after calling subplot()boy <- filter(state_final, Sex == "Boy")
girl <- filter(state_final, Sex == "Girl")
# specify some map projection/options
ch_opts <- list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = FALSE
)
boy_ch <- plot_geo(boy, locationmode = 'USA-states') %>%
add_trace(., type = "choropleth",
z = ~US_rank, text = ~hover, locations = ~State,
zauto = FALSE, zmin = 1, zmax = 30,
color = ~US_rank, colors = "GnBu", reversescale = TRUE
) %>%
colorbar(title = "U.S. Rank", y = -1) %>%
layout(., geo = ch_opts)
girl_ch <- plot_geo(girl, locationmode = 'USA-states') %>%
add_trace(., type = "choropleth",
z = ~US_rank, text = ~hover, locations = ~State,
zauto = FALSE, zmin = 1, zmax = 30,
color = ~US_rank, colors = "PuRd", reversescale = TRUE
) %>%
colorbar(., title = "U.S. Rank", x = 0.45) %>%
layout(., geo = ch_opts)
subplot(girl_ch, boy_ch) %>%
layout(annotations = list(
list(
text = "2017 Girl Names",
xref = "paper", xanchor = "center", x = 0.25,
yref = "paper", yanchor = "bottom", y = 0.1,
align = "center",
showarrow = FALSE
),
list(
text = "2017 Boy Names",
xref = "paper", xanchor = "center", x = 0.75,
yref = "paper", yanchor = "bottom", y = 0.1,
align = "center",
showarrow = FALSE
)
)
)
I’m going to leave this where it is for now. It’s definitely not what I was hoping for but the documentation is rough.
top5 <- group_by(US_baby, Sex, Year) %>%
top_n(., n = 5, wt = n) %>%
arrange(., desc(n)) %>%
mutate(., rank = row_number())
top5_50y <- ungroup(top5) %>%
filter(., Year >= (max(Year) - 50)) %>%
mutate(., Sex = recode(Sex, "M" = "Boy", "F" = "Girl"))
arrange(top5_50y, Year, Sex)
unique(top5_50y$Year) %>% sort()
## [1] 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
## [15] 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
## [29] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
## [43] 2009 2010 2011 2012 2013 2014 2015 2016 2017
girl5 <- filter(top5_50y, Sex == "Girl")
boy5 <- filter(top5_50y, Sex == "Boy")
girl_n <- n_distinct(girl5$Name)
boy_n <- n_distinct(boy5$Name)
col_palette <- colorRampPalette(brewer.pal(9, "Set1"))
n_distinct(top5$Name) # count total unique names (need a color palette with enough colors)
## [1] 72
girl_g <- ggplot(data = girl5, aes(x = Year, y = rank)) +
geom_line(aes(color = Name)) +
scale_y_reverse()
boy_g <- ggplot(data = boy5, aes(x = Year, y = rank)) +
geom_line(aes(color = Name)) +
scale_y_reverse()
col_palette <- colorRampPalette(brewer.pal(9, "Set1"))
subplot(girl_g, boy_g)
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
# boy/girl split by facet (too many names)
ggplot(data = top5, aes(x = Year, y = rank)) +
geom_point(aes(color = Name)) +
facet_grid(Sex ~ .)
plot_ly(data = top5, x = ~Year, y = ~rank,
color = ~Name,
type = "scatter", mode = "lines")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
# doesn't work with transforms...
us_p <- plot_ly(data = top5, x = ~Year, y = ~rank,
color = ~Name, colors = col_palette(n_distinct(top5_50y$Name)),
type = "scatter", mode = "lines",
transforms = list(
list(
type = 'filter',
target = 'Sex',
operation = '=',
value = c("Boy", "Girl")
)
)
)
us_p %>%
layout(.,
updatemenus = list(
list(
type = 'buttons',
active = 0,
buttons = list(
list(method = "restyle",
args = list("transforms[0].value", "Boy"),
label = "Boy"),
list(method = "restyle",
args = list("transforms[0].value", "Girl"),
label = "Girl")
)
)
)
)
Make lines appear by using simpler case (works!)
col_palette <- colorRampPalette(brewer.pal(9, "PuRd")[5:9])
# testing out package `hues` (clustering to get n distinct colors)
library(hues)
col_palette(girl_n)
## [1] "#DF65B0" "#E05BA9" "#E151A3" "#E2489D" "#E43E97" "#E53491" "#E62B8B"
## [8] "#E42683" "#E0227B" "#DC1E73" "#D81B6A" "#D41762" "#D0135A" "#C91054"
## [15] "#C10D51" "#B80A4E" "#AF074B" "#A70548" "#9E0245" "#960041" "#8E003B"
## [22] "#860036" "#7E0030" "#76002A" "#6E0024" "#67001F"
iwanthue(girl_n) %>% attributes()
## $names
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26"
# girls plot
arrange(girl5, Year) %>%
plot_ly(data = ., type = 'scatter', mode = 'lines+markers',
x = ~Year, y = ~rank, color = ~Name, colors = unname(iwanthue(girl_n))
) %>%
layout(.,
yaxis = list(
tickmode = 'linear', tick0 = 1, dtick = 1,
autorange = "reversed",
title = "Rank"
),
xaxis = list(
title = "Year"
)
)
# boys plot
arrange(boy5, Year) %>%
plot_ly(data = ., type = 'scatter', mode = 'lines+markers', connectgaps = FALSE,
x = ~Year, y = ~rank, color = ~Name, colors = unname(iwanthue(boy_n))
) %>%
layout(.,
yaxis = list(
tickmode = 'linear', tick0 = 1, dtick = 1,
autorange = "reversed",
title = "Rank"
),
xaxis = list(
title = "Year"
)
)
# just discovered that lines connect without gaps --> need NA values to create gaps
# in plotly
How do I add the missing years with NA values in rank? (This is something I should be able to do)
girl5
filter(girl5, Name == "Abigail")
# years 1967 - 2017
t <- spread(girl5, key = Name, value = rank) %>%
gather(., key = "Name", value = "rank", -Sex, -n, -Year, na.rm = FALSE)
t
count(t, Name)
This makes n incorrect. A better way to do this might be to go back to the original data, identify the top 5 names each year, filter every name, then either:
tmp_50yr <- mutate(US_baby, Sex = recode(Sex, "F" = "Girl", "M" = "Boy")) %>%
filter(., Year >= max(Year) - 50) %>%
group_by(., Sex, Year) %>%
arrange(., desc(n)) %>%
mutate(., Rank = row_number())
top_girls <- filter(tmp_50yr, Rank <= 5 & Sex == "Girl") %>%
{unique(.$Name)} # some unfortunate girls were given boy names => filter by sex too
top_boys <- filter(tmp_50yr, Rank <= 5 & Sex == "Boy") %>%
{unique(.$Name)} # some unfortunate girls were given boy names => filter by sex too
girl5 <- filter(tmp_50yr, Name %in% top_girls & Sex == "Girl") %>%
arrange(., Name, Year)
boy5 <- filter(tmp_50yr, Name %in% top_boys & Sex == "Boy") %>%
arrange(., Year)
girl_n <- n_distinct(girl5$Name)
boy_n <- n_distinct(boy5$Name)
plot_ly(data = girl5, type = 'scatter', mode = 'lines+markers',
x = ~Year, y = ~Rank, color = ~Name, colors = unname(iwanthue(girl_n))
) %>%
layout(., yaxis = list(autorange = "reversed"))
filter(girl5, Name == "Madison")
Well, that worked but I had assumed that the top 5 names for any given year would always be in the top 1000 and therefore have a rank for each year. That is not the case for all names (example: Madison).
Well, I could go back to the spread/gather approach and before spreading paste n and Rank together so they are spread correctly and then split them after gathering. I could afterward replace NA values in Rank with >1000 but I don’t think I’ll do that. I think I’ll just leave it NA.
glimpse(girl5)
## Observations: 1,314
## Variables: 5
## $ Name <chr> "Abigail", "Abigail", "Abigail", "Abigail", "Abigail", "A...
## $ Sex <chr> "Girl", "Girl", "Girl", "Girl", "Girl", "Girl", "Girl", "...
## $ n <int> 245, 263, 280, 362, 392, 394, 479, 605, 614, 829, 796, 10...
## $ Year <int> 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 197...
## $ Rank <int> 605, 592, 591, 526, 491, 468, 413, 357, 354, 278, 297, 24...
girl5_2 <- unite(girl5, col = n_Rank, n, Rank, sep = "_") %>%
spread(., key = Year, value = n_Rank) %>%
gather(., key = "Year", value = "n_Rank", -Name, -Sex, na.rm = FALSE) %>%
separate(., col = n_Rank, into = c("n", "Rank"), sep = "_", remove = TRUE, convert = TRUE)
plot_ly(data = girl5_2, type = 'scatter', mode = 'lines+markers',
x = ~Year, y = ~Rank, color = ~Name, colors = unname(iwanthue(girl_n))
) %>%
layout(., yaxis = list(autorange = "reversed"))
Just too busy, will scale down to the top 2 names each year
top2_50yr <- mutate(US_baby, Sex = recode(Sex, "F" = "Girl", "M" = "Boy")) %>%
filter(., Year >= max(Year) - 50) %>%
group_by(., Sex, Year) %>%
arrange(., desc(n)) %>%
mutate(., Rank = min_rank(-n))
top2_girls <- filter(top2_50yr, Rank <= 2 & Sex == "Girl") %>%
{unique(.$Name)} # some unfortunate girls were given boy names => filter by sex too
girl2 <- filter(top2_50yr, Name %in% top2_girls & Sex == "Girl") %>%
arrange(., Name, Year) %>%
unite(., col = n_Rank, n, Rank, sep = "_") %>%
spread(., key = Year, value = n_Rank) %>%
gather(., key = "Year", value = "n_Rank", -Name, -Sex, na.rm = FALSE) %>%
separate(., col = n_Rank, into = c("n", "Rank"), sep = "_",
remove = TRUE, convert = TRUE)
top2_boys <- filter(top2_50yr, Rank <= 2 & Sex == "Boy") %>%
{unique(.$Name)} # some unfortunate girls were given boy names => filter by sex too
boy2 <- filter(top2_50yr, Name %in% top2_boys & Sex == "Boy") %>%
arrange(., Year) %>%
unite(., col = n_Rank, n, Rank, sep = "_") %>%
spread(., key = Year, value = n_Rank) %>%
gather(., key = "Year", value = "n_Rank", -Name, -Sex, na.rm = FALSE) %>%
separate(., col = n_Rank, into = c("n", "Rank"), sep = "_",
remove = TRUE, convert = TRUE)
plot_ly(data = girl2, type = 'scatter', mode = 'lines+markers',
x = ~Year, y = ~Rank, color = ~Name, colors = unname(iwanthue(girl_n))
) %>%
layout(., yaxis = list(range = c(5.5, 0.5)))
These should match if the calculations were right… some values are being dropped. WHY?
filter(US_baby, Name == "Madison", Year == 1977)
filter(girl2, Name == "Madison", Year == 1977)
I thought it was because input was tmp_50yr instead of top2_50yr but that did not fix the problem. Examine each dataframe to see when data is dropped
filter(US_baby, Name == "Madison", Year == 1977) # this is taking into account boys names
filter(top2_50yr, Name == "Madison", Year == 1977) # still here
filter(girl2, Name == "Madison", Year == 1977) # missing
So, it’s lost in the creation of girl2. Where? Is top2_girls correct?
filter(top2_50yr, Sex == "Girl", Rank <= 2) %>%
arrange(., Year, Rank) %>%
.$Name %in% top2_girls %>%
all()
## [1] TRUE
Appears to be. Now for girl2.
# is each name-sex combination present for 1967-2017 in top2_50yr?
all_yrs <- unique(top2_50yr$Year) %>%
sort(.)
girl2 <- filter(top2_50yr, Name %in% top2_girls & Sex == "Girl")
group_by(girl2, Name) %>%
summarize(., sum(Year %in% all_yrs, na.rm = TRUE))
Apparently only ‘Madison’ is missing. Why? Is it present in the original data?
filter(US_baby, Name == "Madison" & Sex == "F") %>%
arrange(., Year)
Nope. Okay, so there are NO ERRORS… that’s just the way the data is. I guess ‘Madison’ was a very unpopular name for girls before 1984 and didn’t really even exist before 1970. Just out of curiosity… what about boys named ‘Madison’? and compared to girls?
madison <- filter(US_baby, Name == 'Madison') %>%
arrange(., Year)
# boys name popularity
filter(madison, Sex == 'M') %>%
ggplot(., aes(x = Year, y = n)) +
geom_line()
# boys vs girls popularity
ggplot(madison, aes(x = Year, y = n, color = Sex)) +
geom_line()
Similar trend but the name never really caught on for boys… probably because people overwhelmingly preferred it for girls.